Exploratory data analysis on US predential elections

By Poonam

================================================================================

This is an exploration of 2016 US presidential election donations in the state of California. For this data analysis, I am exoloring the 2016 presidential campaign finance data from Federal Election Commission. The dataset contains financial contribution transaction.

Through my analysis, I will attempt to answer the following questions:

  • Which candidate received the most money?
  • Which political party received the most contributions?
  • What is the spread of occupation of those donors?
  • I would also see how the money flow happened from begin till end.
  • Did Hillary Clinton receive more money than Donald Trump?
# Load all of the packages that will be used for analysis
library(readr)
library(ggplot2)
library(dplyr)
library(tidyr)
library(lubridate)
library(gridExtra)
library(plotly)
library(ggmap)
library(maps)
library(tidyverse)

Summarise the dataset, and check column names.

##    cmte_id            cand_id            cand_nm         
##  Length:1304346     Length:1304346     Length:1304346    
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   contbr_nm         contbr_city         contbr_st        
##  Length:1304346     Length:1304346     Length:1304346    
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    contbr_zip        contbr_employer    contbr_occupation 
##  Min.   :        0   Length:1304346     Length:1304346    
##  1st Qu.:910162321   Class :character   Class :character  
##  Median :930012752   Mode  :character   Mode  :character  
##  Mean   :850773217                                        
##  3rd Qu.:945981502                                        
##  Max.   :961628693                                        
##  NA's   :113                                              
##  contb_receipt_amt  contb_receipt_dt   receipt_desc      
##  Min.   :-10500.0   Length:1304346     Length:1304346    
##  1st Qu.:    15.0   Class :character   Class :character  
##  Median :    27.0   Mode  :character   Mode  :character  
##  Mean   :   116.2                                        
##  3rd Qu.:    88.0                                        
##  Max.   : 10800.0                                        
##                                                          
##    memo_cd           memo_text           form_tp         
##  Length:1304346     Length:1304346     Length:1304346    
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##     file_num         tran_id          election_tp       
##  Min.   :1003942   Length:1304346     Length:1304346    
##  1st Qu.:1077916   Class :character   Class :character  
##  Median :1099613   Mode  :character   Mode  :character  
##  Mean   :1102796                                        
##  3rd Qu.:1133832                                        
##  Max.   :1146285                                        
## 
##  [1] "cmte_id"           "cand_id"           "cand_nm"          
##  [4] "contbr_nm"         "contbr_city"       "contbr_st"        
##  [7] "contbr_zip"        "contbr_employer"   "contbr_occupation"
## [10] "contb_receipt_amt" "contb_receipt_dt"  "receipt_desc"     
## [13] "memo_cd"           "memo_text"         "form_tp"          
## [16] "file_num"          "tran_id"           "election_tp"

Univariate Plot Section

From the dataset summary we found that this dataset contains 1304346 observations and 18 variables. Let’s plot contribution graphs against different variables.

Let’s do some basic null checks, before we proceed.

There are no null values under contribution amount column. Let’s plot a simple histogram.

Plot another detailed contribution histogram.

This graph also shows that large number of people have contributed below $500. Also there is a significant number of people those who have contributed between 2500 and 3000.

We can see from the above histogram, that there are negative values also in the contribution amount. Let’s see if they have any other relevant details. Check their receipt description field.

## # A tibble: 16,313 x 9
##    cand_nm          contbr… contb… contbr… contb_… cont… rece… memo… elec…
##    <chr>            <chr>   <chr>  <chr>     <dbl> <chr> <chr> <chr> <chr>
##  1 Trump, Donald J. ROPPA,… ALISO… RETIRED  -40.0  21-A… <NA>  <NA>  G2016
##  2 Trump, Donald J. SHARP,… CARLS… RETIRED  - 4.00 06-S… <NA>  <NA>  G2016
##  3 Trump, Donald J. SHARP,… CARLS… RETIRED  - 4.00 06-O… <NA>  <NA>  G2016
##  4 Trump, Donald J. SHARP,… CARLS… RETIRED  -28.0  18-O… <NA>  <NA>  G2016
##  5 Trump, Donald J. SHARP,… CARLS… RETIRED  -28.0  25-O… <NA>  <NA>  G2016
##  6 Trump, Donald J. PAVLOV… CAMAR… UNEMPL…  -40.0  26-A… <NA>  <NA>  G2016
##  7 Trump, Donald J. SHARP,… CARLS… RETIRED  -28.0  01-N… <NA>  <NA>  G2016
##  8 Trump, Donald J. SHARP,… CARLS… RETIRED  - 4.00 06-N… <NA>  <NA>  G2016
##  9 Trump, Donald J. SHAW, … POMONA RETIRED  -20.0  16-O… <NA>  <NA>  G2016
## 10 Trump, Donald J. SHAW, … POMONA RETIRED  -20.0  23-O… <NA>  <NA>  G2016
## # ... with 16,303 more rows
## [1] 10412
## [1] 30

There are lot of records (10,412 out of 16,313), that have receipt description as “Refund”. It could be that the contributor changed his mind, and asked for a refund at a later stage. Other 2 categories of description are “Redesignation” and “Reattribution”. Let’s see how significant is the sum of negative amount in total.

Lets do some calculation to see the total sum of positive and negative values.

Check number of rows containing positive amount, and number of rows containing negative amount.

So out of 1304346 observations, 16313 are with negative amount values.

Lets see how the contribution is done over a period of time. We will plot a time series line chart for each party to see trend of contribution received towards begin or end of the election time.

Let’s do a basic null check on date before we proceed

## [1] 0

There are no rows without date field. Check how many unique date records are present.

count_of_contb_dt <- length(unique(DF$contb_receipt_dt, incomparables = FALSE, 
  MARGIN = 1, fromLast = FALSE))
print(count_of_contb_dt)
## [1] 732

So the contribution distribution is spread across more than 2 years. Let’s group the contribution by month.

The top 10 dates of contribution show Dec 2015 being the peak month of contribution, followed by Sep 2015.

Let’s visualize through a graph.

Univariate Analysis

In the Univariate section we explored the “Contribution Amount” variable. We saw that contribution was mostly between the range of 0 to 500 Dollar. Also there was a slight peak at 1000 and 2700 dollar. There were mostly positive contribution values, but there were some observations with negative values too. The negative values had description as Refund, Reattribution and Redesignation. For the purpose of current exploration the total amount is calculated as sum of positive amount minus the sum of negative amount.

There are 1304346 contributions and 18 variables. The variables that interest to me and I have used are:

cand_nm: Candidate Name contb_receipt_amt: Contribution Amount contbr_occupation: Contributor Occupation contbr_city: Contributor City contb_receipt_dt: Contribution date election_tp: Type of election (Primary, General)

Othere observations:

Most people contribute small amount of money. The median contribution amount is $27, mean contribution amount is $116. The amount of contribution is highest in Aug-Sep 2016, that is just before the election.

From the above graph we see that contribution was maximum between Aug 2016 till Oct 2016, just before the election time frame. Also we can note that the contribution started to pick from Apr 2015.

Bivariate Plot Section

Next we will see how this total “contribution amount” is distributed w.r.t other factors like candidates, political parties, contributor’s occupation and contributor’s city.

Get the Unique candidate names to see how many candidates stood up for election. Summarise the total contribution amount for each candidate.

count_of_nm <- length(unique(DF$cand_nm, incomparables = FALSE, MARGIN = 1, 
  fromLast = FALSE))
print(count_of_nm)
## [1] 25
print(unique(DF$cand_nm)) 
##  [1] "Clinton, Hillary Rodham"   "Trump, Donald J."         
##  [3] "Sanders, Bernard"          "O'Malley, Martin Joseph"  
##  [5] "Santorum, Richard J."      "Cruz, Rafael Edward 'Ted'"
##  [7] "Walker, Scott"             "Bush, Jeb"                
##  [9] "Rubio, Marco"              "Kasich, John R."          
## [11] "Christie, Christopher J."  "Johnson, Gary"            
## [13] "Paul, Rand"                "Webb, James Henry Jr."    
## [15] "Carson, Benjamin S."       "Fiorina, Carly"           
## [17] "Jindal, Bobby"             "Huckabee, Mike"           
## [19] "Lessig, Lawrence"          "Graham, Lindsey O."       
## [21] "Pataki, George E."         "Stein, Jill"              
## [23] "Perry, James R. (Rick)"    "McMullin, Evan"           
## [25] "Gilmore, James S III"

We see that there are 25 unique candidtes. Plot a bar chart to see how much contribution amount is received per candidate.

DF_cand_dist <- DF %>% 
  group_by(cand_nm) %>% 
  summarise(candidate_amt = sum(contb_receipt_amt, na.rm=TRUE),
            n= n())
DF_cand_dist
## # A tibble: 25 x 3
##    cand_nm                   candidate_amt      n
##    <chr>                             <dbl>  <int>
##  1 Bush, Jeb                       3300292   3130
##  2 Carson, Benjamin S.             2912555  27370
##  3 Christie, Christopher J.         456066    333
##  4 Clinton, Hillary Rodham        93681171 688524
##  5 Cruz, Rafael Edward 'Ted'       5730682  57822
##  6 Fiorina, Carly                  1450689   4706
##  7 Gilmore, James S III               8100      3
##  8 Graham, Lindsey O.               414495    347
##  9 Huckabee, Mike                   230891    531
## 10 Jindal, Bobby                     23231     31
## # ... with 15 more rows
DF_cand_dist <- head(arrange(DF_cand_dist, desc(candidate_amt)), n= 10)

Let’s plot the graph.

Check political party wise contribution.

Draw pie chart to see distribution of contribution amount received by each political party.

First sum the contribution party wise.

DF_party_dist <- DF %>% 
  group_by(Political_Party) %>% 
  summarise(party_amt = sum(contb_receipt_amt, na.rm=TRUE))

head(DF_party_dist)
## # A tibble: 4 x 2
##   Political_Party    party_amt
##   <chr>                  <dbl>
## 1 Democratic_Party   113491148
## 2 Green_Party_of_USA    751785
## 3 Libretarian_Party     495231
## 4 Republic_Party      36865654

Lets see how many election types are there.

count_of_election_tp <- length(unique(DF$election_tp, incomparables= FALSE,
  MARGIN = 1, fromLast = FALSE))
print(count_of_election_tp)
## [1] 5
print(unique(DF$election_tp))
## [1] "P2016" "G2016" NA      "P2020" "O2016"

We see that there are 5 unique election types. Let’s see contribution per election type.

DF_election_tp <- DF %>% 
  group_by(election_tp) %>% 
  summarise(sum_election_tp = sum(contb_receipt_amt, na.rm=TRUE),
                              mean_election_tp = mean(contb_receipt_amt), 
                              n = n())
DF_election_tp <- head(arrange(DF_election_tp,desc(sum_election_tp)))
head(DF_election_tp)
## # A tibble: 5 x 4
##   election_tp sum_election_tp mean_election_tp      n
##   <chr>                 <dbl>            <dbl>  <int>
## 1 P2016              93965415              115 818021
## 2 G2016              56931973              118 483991
## 3 O2016                453994              718    632
## 4 <NA>                 237435              140   1695
## 5 P2020                 15000             2143      7

Most of the contribution is for election type P2016. This could be the Primary election donation. The next type of election that has received most contributions is G2016, this could be the general elections. Source of information wikipedia website.

## # A tibble: 818,021 x 9
##    cand_nm  contbr… contbr… contbr_… contb_r… contb… rece… memo_text elec…
##    <chr>    <chr>   <chr>   <chr>       <dbl> <chr>  <chr> <chr>     <chr>
##  1 Clinton… AULL, … LARKSP… RETIRED     50.0  26-AP… <NA>  * HILLAR… P2016
##  2 Clinton… CARROL… CAMBRIA RETIRED    200    20-AP… <NA>  * HILLAR… P2016
##  3 Clinton… GANDAR… FONTANA RETIRED      5.00 02-AP… <NA>  * HILLAR… P2016
##  4 Sanders… LEE, A… CAMARI… SOFTWAR…    40.0  04-MA… <NA>  * EARMAR… P2016
##  5 Sanders… LEONEL… REDOND… PHARMAC…    35.0  05-MA… <NA>  * EARMAR… P2016
##  6 Sanders… LEONEL… REDOND… PHARMAC…   100    06-MA… <NA>  * EARMAR… P2016
##  7 Sanders… LEOPAR… VISTA   PROJECT…    25.0  04-MA… <NA>  * EARMAR… P2016
##  8 Clinton… HOFER,… LAGUNA… RETIRED     40.0  20-AP… <NA>  * HILLAR… P2016
##  9 Sanders… LEPKE,… WESTMI… NOT EMP…    10.0  05-MA… <NA>  * EARMAR… P2016
## 10 Sanders… LERCH,… PETALU… DIRECTO…    15.0  06-MA… <NA>  * EARMAR… P2016
## # ... with 818,011 more rows
## # A tibble: 6 x 4
##   cand_nm                   sum_election_tp_p2016 mean_election_tp      n
##   <chr>                                     <dbl>            <dbl>  <int>
## 1 Clinton, Hillary Rodham                46434728            181   256294
## 2 Sanders, Bernard                       19623823             48.2 407163
## 3 Cruz, Rafael Edward 'Ted'               5900103            105    56402
## 4 Rubio, Marco                            4995681            376    13272
## 5 Trump, Donald J.                        4439995            113    39164
## 6 Bush, Jeb                               3317092           1085     3057
##    cand_nm          sum_election_tp_p2016 mean_election_tp
##  Length:6           Min.   : 3317092      Min.   :  48.2  
##  Class :character   1st Qu.: 4578916      1st Qu.: 106.8  
##  Mode  :character   Median : 5447892      Median : 147.3  
##                     Mean   :14118570      Mean   : 318.1  
##                     3rd Qu.:16192893      3rd Qu.: 327.6  
##                     Max.   :46434728      Max.   :1085.1  
##        n         
##  Min.   :  3057  
##  1st Qu.: 19745  
##  Median : 47783  
##  Mean   :129225  
##  3rd Qu.:206321  
##  Max.   :407163

Above is a graph for primary election contribution amount by candidates. Hillary got most contribution during the primary elections as compared to other candidates, followed by Bernard Sanders. The mean contribution amount is $318.1 for primary elections, this is more than the mean of total contribution amount that is $116.

count_of_occupation <- length(unique(DF$contbr_occupation, incomparables= FALSE,
  MARGIN = 1, fromLast = FALSE))
print(count_of_occupation)
## [1] 28616

We see that there are 28616 unique occupations of the contributors. People from so many occupations participated in contributing to the election. It would be difficult to see contribution spread against all of the occupation. Lets pick top 10 occupation categories.

DF_occu_dist <- DF %>% 
  group_by(contbr_occupation) %>% 
  summarise(sum_occup = sum(contb_receipt_amt, na.rm=TRUE),
                              mean_occu = mean(contb_receipt_amt), 
                              n = n())
DF_occu_dist <- head(arrange(DF_occu_dist,desc(sum_occup)), n = 10)
DF_occu_dist
## # A tibble: 10 x 4
##    contbr_occupation     sum_occup mean_occu      n
##    <chr>                     <dbl>     <dbl>  <int>
##  1 RETIRED                25158880      96.6 260546
##  2 ATTORNEY                8242015     225    36642
##  3 NOT EMPLOYED            6484973      56.1 115598
##  4 INFORMATION REQUESTED   5388307     174    30948
##  5 HOMEMAKER               4800790     278    17268
##  6 CEO                     3421376     477     7174
##  7 PHYSICIAN               2644065     164    16111
##  8 CONSULTANT              2503911     179    13961
##  9 PRESIDENT               2354773     498     4728
## 10 LAWYER                  2157174     241     8945
summary(DF_occu_dist)
##  contbr_occupation    sum_occup          mean_occu           n         
##  Length:10          Min.   : 2157174   Min.   : 56.1   Min.   :  4728  
##  Class :character   1st Qu.: 2538950   1st Qu.:166.6   1st Qu.: 10199  
##  Mode  :character   Median : 4111083   Median :202.1   Median : 16690  
##                     Mean   : 6315626   Mean   :238.9   Mean   : 51192  
##                     3rd Qu.: 6210806   3rd Qu.:268.8   3rd Qu.: 35218  
##                     Max.   :25158880   Max.   :498.0   Max.   :260546

From the summary we see that the occupation categories “ATTORNEY”, “HOMEMAKER”, “CEO”, “PRESIDENT”, “LAWYER” have the higher mean of contribution than compared to the mean of total contributing amount from all categories together.

Category “PRESIDENT” has the maximum mean of contributing amount. Category “NOT EMPLOYED” has the minimum mean of contributing amount.

This is an amazing graph. It shows retired category of people contributing most to the election funds.

Check city wise contribution, and then find out top 10 contributing cities. Below is graph for top 10 contributing cities.

count_of_city <- length(unique(DF$contbr_city, incomparables = FALSE, 
  MARGIN = 1, fromLast = FALSE))
print(count_of_city)
## [1] 2534
DF_contbr_city <- DF %>% 
  group_by(contbr_city) %>% 
  summarise(sum_city = sum(contb_receipt_amt, na.rm=TRUE),
                              mean_city = mean(contb_receipt_amt), 
                              n = n())
DF_contbr_city <- head(arrange(DF_contbr_city,desc(sum_city)), n = 10)
DF_contbr_city
## # A tibble: 10 x 4
##    contbr_city   sum_city mean_city      n
##    <chr>            <dbl>     <dbl>  <int>
##  1 LOS ANGELES   16220656     158   102710
##  2 SAN FRANCISCO 15376476     169    90937
##  3 SAN DIEGO      3849797      83.5  46129
##  4 PALO ALTO      3261409     269    12105
##  5 OAKLAND        3150637      94.8  33235
##  6 BEVERLY HILLS  3125763     460     6796
##  7 BERKELEY       2863864     124    23150
##  8 SANTA MONICA   2854454     197    14495
##  9 SAN JOSE       2408418      78.5  30674
## 10 SACRAMENTO     2343078      98.5  23799

Los Angeles is the most contributing city out of all, followed closely by San Francisco. We see that the “amount of contribution” from these cities is more, but is the “count of contributions” also more from these cities. Let’s see number of contributions per candidate, occupation and city.

Group contribution amount per candidate, per occupation, per city.

options(scipen = 999)
DF_cand_occu_city_grp <- DF %>% 
   group_by(.dots=c("cand_nm","contbr_occupation","contbr_city")) %>% 
   summarise(sum_cand_occu_city=sum(contb_receipt_amt),
             n = n())

DF_cand_occu_city_grp <- (arrange(DF_cand_occu_city_grp,desc(sum_cand_occu_city)))
DF_cand_occu_city_grp
## # A tibble: 118,021 x 5
## # Groups: cand_nm, contbr_occupation [39,202]
##    cand_nm                 contbr_occupation contbr_city   sum_cand…     n
##    <chr>                   <chr>             <chr>             <dbl> <int>
##  1 Clinton, Hillary Rodham RETIRED           SAN FRANCISCO   1172716  7570
##  2 Clinton, Hillary Rodham ATTORNEY          LOS ANGELES     1138550  3948
##  3 Clinton, Hillary Rodham ATTORNEY          SAN FRANCISCO   1045324  3687
##  4 Clinton, Hillary Rodham RETIRED           LOS ANGELES      978573  8480
##  5 Clinton, Hillary Rodham WRITER            LOS ANGELES      475835  2620
##  6 Clinton, Hillary Rodham RETIRED           SAN DIEGO        460634  5875
##  7 Clinton, Hillary Rodham RETIRED           BERKELEY         370685  2626
##  8 Clinton, Hillary Rodham RETIRED           OAKLAND          369695  4469
##  9 Clinton, Hillary Rodham RETIRED           SACRAMENTO       338157  4152
## 10 Clinton, Hillary Rodham HOMEMAKER         LOS ANGELES      334893   789
## # ... with 118,011 more rows
##  [1] "Clinton, Hillary Rodham"   "Sanders, Bernard"         
##  [3] "Trump, Donald J."          "Cruz, Rafael Edward 'Ted'"
##  [5] "Rubio, Marco"              "Bush, Jeb"                
##  [7] "Carson, Benjamin S."       "Kasich, John R."          
##  [9] "Fiorina, Carly"            "Paul, Rand"
##  [1] "RETIRED"               "ATTORNEY"             
##  [3] "NOT EMPLOYED"          "INFORMATION REQUESTED"
##  [5] "HOMEMAKER"             "CEO"                  
##  [7] "PHYSICIAN"             "CONSULTANT"           
##  [9] "PRESIDENT"             "LAWYER"
##  [1] "LOS ANGELES"   "SAN FRANCISCO" "SAN DIEGO"     "PALO ALTO"    
##  [5] "OAKLAND"       "BEVERLY HILLS" "BERKELEY"      "SANTA MONICA" 
##  [9] "SAN JOSE"      "SACRAMENTO"

In the above graph the occupation categories are placed as per the amount of donations they have made. The most significant observation is between “Attorney” and “Not Employed”. We see, that though more number of “Not Employed” people have donated but there amount of contribution was less than the amount of contributions done by category “Attorney”.

On the contribution city graph also the cities are arranged in the descending order of their amount of contributions done. The most significant point is where amount of contributions done by city “Palo Alto” is more than “Oakland”, but here in the graph the number of contributions done by city of “Palo Alto” is much lesser than “Oakland”.

This may suggest that more financially sound people stay in “Palo Alto”.

with(DF, cor(contb_receipt_amt, rank(cand_nm)))
with(DF, cor(contb_receipt_amt, rank(contbr_occupation)))
with(DF, cor(contb_receipt_amt, rank(contbr_city)))

with(DF, cor(rank(cand_nm), rank(contbr_occupation)))
with(DF, cor(rank(cand_nm), rank(contbr_city)))
with(DF, cor(rank(contbr_occupation), rank(contbr_city)))

with(DF_cand_occu_city_grp, cor(sum_cand_occu_city, n))

There doesn’t seem to be much correlation between candidate name and contributor city, or between contributor city and occupation. One thing that shows a strong uphill correlation is sum of contrubition amount and the number of contributions. That is more the “number of contributions” per candidate, per occupation, per city , more is the value of contribution.

Percentage of “Retired” category of the total contributors is 20%. That is almost 1/5th of the total contributors are Retired Category.

Let’s check the retired percentage for Los Angeles and San Francisco.

We calculated that, roughly 10% of all the contributors from Los Angeles are of “Retired” category. This is half of total percentage.

Similar to Los Angeles, from San Francisco also 10% of all the contributors are of “Retired” category. This data shows the percentage of “Retired” category contributors is quite significant in other cities also, and not in LA and SF only.

Bivariate Analyasis

For the bivariate analysis I saw how the total contribution amount is distributed w.r.t following factors.

There is no mention of political party in the dataset, this was added to the dataset. I added the column political_party and filled the column with corresponding party name for each candidate’s political party. Used website http://www.politifact.com/ to get info on 25 unique candidates. I found that 25 candidates belonged to 4 different political parties, namely

Hillary Clinton received the most contribution. That also reflected in the political party she represented. Democratic party received the most contribution almost 75% of the total contributed amount. Retired people contributed the most Most contributing cities were Los Angeles and San Francisco. Most of the contributions were made for primary election.

Multivariate Plot Section

Let’s plot a map diagram, to see location of the most contributing cities, on the California map.

##          lon      lat       contbr_city sum_city mean_amt_per_city      n
## 1  -118.2437 34.05223       LOS ANGELES 16220656         157.92675 102710
## 2  -122.4194 37.77493     SAN FRANCISCO 15376476         169.08933  90937
## 3  -117.1611 32.71574         SAN DIEGO  3849797          83.45719  46129
## 4  -122.1430 37.44188         PALO ALTO  3261409         269.42659  12105
## 5  -122.2711 37.80436           OAKLAND  3150637          94.79876  33235
## 6  -118.4004 34.07362     BEVERLY HILLS  3125763         459.94154   6796
## 7  -122.2585 37.87190          BERKELEY  2863864         123.70903  23150
## 8         NA       NA      SANTA MONICA  2854454         196.92678  14495
## 9  -121.8863 37.33821          SAN JOSE  2408418          78.51659  30674
## 10 -121.4944 38.58157        SACRAMENTO  2343078          98.45277  23799
## 11 -118.1445 34.14778          PASADENA  1649200         128.19274  12865
## 12        NA       NA        MENLO PARK  1520624         284.65444   5342
## 13        NA       NA PACIFIC PALISADES  1490631         324.54418   4593
## 14 -117.9298 33.61888     NEWPORT BEACH  1484437         282.64230   5252
## 15 -119.6982 34.42083     SANTA BARBARA  1480510         123.38609  11999
## 16 -122.1141 37.38522         LOS ALTOS  1321927         300.91663   4393
## 17 -118.1937 33.77005        LONG BEACH  1184770          75.19005  15757
## 18 -117.8265 33.68457            IRVINE  1172455         138.00084   8496
## 19 -118.4514 34.14897      SHERMAN OAKS  1091782         142.75388   7648
## 20 -122.1977 37.46133          ATHERTON  1086700         878.49639   1237

In the above graph, size of the orange dot specifies the count of contribution.

Let’s plot some heat maps, to see the multivariate effect on the contributions. We will see contributions for candidates w..r.t cities and occupations. So far we know most contributing occupation, but we will see most contributing occupation per city through some heat maps.

## # A tibble: 9,742 x 4
## # Groups: cand_nm [25]
##    cand_nm                 contbr_city   sum_cand_city     n
##    <chr>                   <chr>                 <dbl> <int>
##  1 Clinton, Hillary Rodham SAN FRANCISCO      12220950 56833
##  2 Clinton, Hillary Rodham LOS ANGELES        11997666 65166
##  3 Clinton, Hillary Rodham PALO ALTO           2603319  8344
##  4 Clinton, Hillary Rodham OAKLAND             2276655 19587
##  5 Clinton, Hillary Rodham BEVERLY HILLS       2190303  4576
##  6 Clinton, Hillary Rodham BERKELEY            2143883 12061
##  7 Clinton, Hillary Rodham SANTA MONICA        2105459  9031
##  8 Clinton, Hillary Rodham SAN DIEGO           2070972 24283
##  9 Sanders, Bernard        SAN FRANCISCO       1814651 31078
## 10 Clinton, Hillary Rodham SACRAMENTO          1618197 14036
## # ... with 9,732 more rows

## # A tibble: 39,202 x 4
## # Groups: cand_nm [25]
##    cand_nm                 contbr_occupation     sum_cand_occu      n
##    <chr>                   <chr>                         <dbl>  <int>
##  1 Clinton, Hillary Rodham RETIRED                    14238969 161257
##  2 Clinton, Hillary Rodham ATTORNEY                    6727956  27956
##  3 Sanders, Bernard        NOT EMPLOYED                5286576 105655
##  4 Trump, Donald J.        RETIRED                     4449440  34257
##  5 Clinton, Hillary Rodham INFORMATION REQUESTED       3050159  13897
##  6 Clinton, Hillary Rodham HOMEMAKER                   2877721  12148
##  7 Clinton, Hillary Rodham CEO                         2150276   4111
##  8 Clinton, Hillary Rodham CONSULTANT                  2015791   9402
##  9 Clinton, Hillary Rodham LAWYER                      1801739   6917
## 10 Clinton, Hillary Rodham PHYSICIAN                   1795537   9917
## # ... with 39,192 more rows

## # A tibble: 92,238 x 4
## # Groups: contbr_occupation [28,616]
##    contbr_occupation contbr_city   sum_cand_occu     n
##    <chr>             <chr>                 <dbl> <int>
##  1 RETIRED           SAN FRANCISCO       1476819  9096
##  2 ATTORNEY          LOS ANGELES         1366596  4800
##  3 RETIRED           LOS ANGELES         1350867 10387
##  4 ATTORNEY          SAN FRANCISCO       1135922  4350
##  5 RETIRED           SAN DIEGO            787443  9224
##  6 HOMEMAKER         LOS ANGELES          547853   973
##  7 WRITER            LOS ANGELES          541229  3391
##  8 NOT EMPLOYED      SAN FRANCISCO        495329  5763
##  9 RETIRED           SACRAMENTO           478660  5816
## 10 SOFTWARE ENGINEER SAN FRANCISCO        466443  3658
## # ... with 92,228 more rows

The above heatmaps show, Hillary having good number of contributions from Los Angeles and San Francisco. Only top two candidates have more than 2000, contributions from most of the cities, rest of the candidates have less than 2000 contributions, across all cities.

On the occupation wise map also Hillary has got contributions across all occupations.

“Not Employed” category of people have contributed most to Bernard Sanders.

Multivariate Analysis

We know that Hillary Clinton raised the most money and had the most supporters in California. But is this always true throughout the campaign process? When we look at above 2 graphs, we can notice few things.

Hillary Clinton had most number of contributions throughout. Number of contributions for Bernard Sanders rose quite consistently. Number of contributions for Donald Trump fell towards the end of campaign. Towards the end only Bernard Sanders was in some competition to Hillary Clinton in terms of number of contributions.

Final Plots and Summary

Plot One

Plot One description

This graph shows the count of contributions for each range of amount. Large number of people have made small donations between 0 to 250 dollars. Many contributions are done for the amount of 500, 1000 and 2700 Dollars. From the summary we saw that mean contribution amount is 116 Dollars. This can be seen on the graph.

Plot Two

Plot Two description

Hillary Clinton was the top candidate in terms of contibutions recieved. Her share of contribution was highest from the begin of primary election too. This graph answers my question that I thought of at the beginning of my exploration.

Plot Three

Plot Three description

This was one of the most interesting graph. Retired people contributed most to the 2016 election. We had also known that Los Angeles was the most contributing city. Does this mean that most of the retired category people stay at LA? This may not be a direct correlation, but something that can be explored. Also another correlation that can be thought of is that did Hillary receive most contribution from “Retired” category of people?

Reflection

This was a large dataset with more than a million and a quarter observations, which had details about the contributions made to political candidates during the 2016 US Presidential elections.

I was most interested to see which political party received the most funds. There was no political party column in the given dataset. I found the unique candidate names first and then searched for their parties, to finally see the pie chart for party wise funding. For the purpose of seeing the trend of contributions, on a time series, I had added two columns Month_Yr, and yyyymm.

The dataset was not a perfect clean data. I was getting parsing error while creating the dataframe. I found out that there was a extra comma character at the end of the last column, which was removed for successful parsing of csv file . Also there were 7 columns where the zip code was non integer, example N4W2T. I found out that this zipcode belonged to Canada, and not California, USA. Such records were replaced by ‘000000000’ value.

The most difficult decision for me was to handle the negative amount values. I did not want to ignore them initially. But after the entire exploration I realize that probably ignoring those values was a better choice. By the description it shows that, it is the contribution money to be refunded. It may not have reached the candidate/party at all. It was marked as negative, to be actually ignored.

For the future exploration I would like to see number of contributions and their respective contributors for large contribution amounts, above a certain average.

I could see total number of contributors per candidate, would like to see the data of number of contributors per candidate per city per occupation in one graph. I couldn’t achieve more than 2 group by in one single graph.

During explortion I realized that most of the data in this dataset is categorical, except one continuous data that is “Contribution Amount”. Rest most of the variables, that is candidate name, contributor occupation, contributor city, election type etc were discrete data points. So I have mostly plotted bar charts, and not scatter plots or line charts.